{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Use Caffe On TH-2 DEMO\n",
    "\n",
    "> 演示如何在天河上使用Caffe   2016/5/20 by : lijiang\n",
    "\n",
    "*Caffe 是目前流行的深度学习框架，前几天才在天河上进行了部署，今天因有用户询问使用方法做此说明文档。*\n",
    "\n",
    "*Caffe 目前的代码没有针对 CPU 进行优化，更适合使用 GPU 来进行计算。但有客户在对CAFFE 进行 CPU 的优化，以后会在TH-2 上进行发布 *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "alias module='source ~/.module.sh'\n",
    "export MODULEPATH=/NSFCGZ/app/modulefiles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 这两行代码只是为了我配置演示环境，一般用户并不需要"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. 加载环境\n",
    "使用 module load 加载 caffe 环境 ； 首先，在不了解系统时可以通过 module avail 查看有哪些可用版本 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r\n",
      "--------------------------- /NSFCGZ/app/modulefiles ----------------------------\r\n",
      "caffe/v20160510-cpu3    caffe/v20160511-gpu-icc\r\n"
     ]
    }
   ],
   "source": [
    "module avail caffe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里可以看到，在 基金用户的分区 （/NSFCGZ） 目前部署了两个版本的 caffe , 分别对应使用 CPU 和 GPU .本例演示使用CPU 的版本。\n",
    "\n",
    "> 如果是在前面的WORK 节点，可能能看到更多的， 但有一些并不推荐使用。\n",
    "\n",
    "使用module load 加载环境."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/NSFCGZ/app/caffe/v20160510-cpu3/bin/caffe\r\n"
     ]
    }
   ],
   "source": [
    "module load caffe/v20160510-cpu3\n",
    "which caffe "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里可以看到，caffe 命令及其环境已经成功加载了。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. 数据处理\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "bin  caffe-master  include  lib  python  share\r\n"
     ]
    }
   ],
   "source": [
    "ls /NSFCGZ/app/caffe/v20160510-cpu3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "用户可以自己上传 caffe-master 文件夹，也可以直接从 /NSFCGZ/app/caffe/v20160510-cpu3/caffe-master  复制一份"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CMakeLists.txt\t Makefile\t\t  caffe.cloc  examples\tscripts\r\n",
      "CONTRIBUTING.md  Makefile.config\t  cmake       include\tsrc\r\n",
      "CONTRIBUTORS.md  Makefile.config.example  data\t      matlab\ttools\r\n",
      "INSTALL.md\t README.md\t\t  docker      models\r\n",
      "LICENSE\t\t build\t\t\t  docs\t      python\r\n"
     ]
    }
   ],
   "source": [
    "cp -r /NSFCGZ/app/caffe/v20160510-cpu3/caffe-master ~/.\n",
    "cd ~/caffe-master \n",
    "ls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我原来已经下载了 算例 cifar10 的数据； 但为了完整演示，这里使用算例 mnist"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "get_mnist.sh\r\n"
     ]
    }
   ],
   "source": [
    "ls data/mnist "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看到： 这个文件夹里面只有一个脚本文件；如果按在自己电脑上的方法去执行则会出错：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading...\r\n",
      "--2016-05-20 15:58:24--  http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz\r\n",
      "Resolving yann.lecun.com... failed: Temporary failure in name resolution.\r\n",
      "wget: unable to resolve host address `yann.lecun.com'\r\n",
      "gzip: train-images-idx3-ubyte.gz: No such file or directory\r\n",
      "--2016-05-20 15:58:24--  http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz\r\n",
      "Resolving yann.lecun.com... failed: Temporary failure in name resolution.\r\n",
      "wget: unable to resolve host address `yann.lecun.com'\r\n",
      "gzip: train-labels-idx1-ubyte.gz: No such file or directory\r\n",
      "--2016-05-20 15:58:24--  http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz\r\n",
      "Resolving yann.lecun.com... failed: Temporary failure in name resolution.\r\n",
      "wget: unable to resolve host address `yann.lecun.com'\r\n",
      "gzip: t10k-images-idx3-ubyte.gz: No such file or directory\r\n",
      "--2016-05-20 15:58:24--  http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz\r\n",
      "Resolving yann.lecun.com... failed: Temporary failure in name resolution.\r\n",
      "wget: unable to resolve host address `yann.lecun.com'\r\n",
      "gzip: t10k-labels-idx1-ubyte.gz: No such file or directory\r\n"
     ]
    }
   ],
   "source": [
    "data/mnist/get_mnist.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "原因很简单，机器是不联网的。。。我们需要手工将压缩包或者数据文件放过来。\n",
    "这里可以从 /NSFCGZ/app/share 拷贝一份过来。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "cp /NSFCGZ/app/share/mnist_data.tar.gz data/mnist/."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "然后修改一下 data/mnist/get_mnist.sh 这个脚本 \n",
    "（这里没法演示修改的过程；大家自己修改 , 这里用 diff 查看修改前后的对比 ， 后面如有文件修改同此）："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "8c8\r\n",
      "< \r\n",
      "---\r\n",
      "> tar xvf *.tar.gz\r\n",
      "12c12\r\n",
      "<         wget --no-check-certificate http://yann.lecun.com/exdb/mnist/${fname}.gz\r\n",
      "---\r\n",
      ">         #wget --no-check-certificate http://yann.lecun.com/exdb/mnist/${fname}.gz\r\n"
     ]
    }
   ],
   "source": [
    "diff data/mnist/get_mnist.sh data/mnist/get_mnist-th.sh "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading...\r\n",
      "train-images-idx3-ubyte.gz\r\n",
      "train-labels-idx1-ubyte.gz\r\n",
      "t10k-images-idx3-ubyte.gz\r\n",
      "t10k-labels-idx1-ubyte.gz\r\n",
      "get_mnist-th.sh   mnist_data.tar.gz\t  train-images-idx3-ubyte\r\n",
      "get_mnist.sh\t  t10k-images-idx3-ubyte  train-labels-idx1-ubyte\r\n",
      "get_mnist.sh-org  t10k-labels-idx1-ubyte\r\n"
     ]
    }
   ],
   "source": [
    "data/mnist/get_mnist-th.sh \n",
    "ls data/mnist/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "执行完脚本后可以发现数据已经就绪了。进行下一步操作。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Creating lmdb...\r\n",
      "examples/mnist/create_mnist.sh: line 16: build/examples/mnist/convert_mnist_data.bin: Permission denied\r\n",
      "examples/mnist/create_mnist.sh: line 18: build/examples/mnist/convert_mnist_data.bin: Permission denied\r\n",
      "Done.\r\n"
     ]
    }
   ],
   "source": [
    "examples/mnist/create_mnist.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里执行会出错，同样需要稍微修改下文件 ：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "16c16,17\r\n",
      "< $BUILD/convert_mnist_data.bin $DATA/train-images-idx3-ubyte \\\r\n",
      "---\r\n",
      "> #$BUILD/\r\n",
      "> convert_mnist_data $DATA/train-images-idx3-ubyte \\\r\n",
      "18c19,20\r\n",
      "< $BUILD/convert_mnist_data.bin $DATA/t10k-images-idx3-ubyte \\\r\n",
      "---\r\n",
      "> #$BUILD/\r\n",
      "> convert_mnist_data $DATA/t10k-images-idx3-ubyte \\\r\n"
     ]
    }
   ],
   "source": [
    "diff examples/mnist/create_mnist.sh  examples/mnist/create_mnist_th.sh "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Creating lmdb...\r\n",
      "I0520 16:10:11.905948 10730 db_lmdb.cpp:35] Opened lmdb examples/mnist/mnist_train_lmdb\r\n",
      "I0520 16:10:11.914786 10730 convert_mnist_data.cpp:88] A total of 60000 items.\r\n",
      "I0520 16:10:11.914836 10730 convert_mnist_data.cpp:89] Rows: 28 Cols: 28\r\n",
      "I0520 16:10:13.172996 10730 convert_mnist_data.cpp:108] Processed 60000 files.\r\n",
      "I0520 16:10:13.472939 10737 db_lmdb.cpp:35] Opened lmdb examples/mnist/mnist_test_lmdb\r\n",
      "I0520 16:10:13.474231 10737 convert_mnist_data.cpp:88] A total of 10000 items.\r\n",
      "I0520 16:10:13.474262 10737 convert_mnist_data.cpp:89] Rows: 28 Cols: 28\r\n",
      "I0520 16:10:13.711577 10737 convert_mnist_data.cpp:108] Processed 10000 files.\r\n",
      "Done.\r\n"
     ]
    }
   ],
   "source": [
    "examples/mnist/create_mnist_th.sh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "执行修改后的文件，一切顺利"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 计算\n",
    "\n",
    "本例要演示的计算文件为 examples/mnist/train_lenet.sh ； 同样需要略作修改 ："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3c3,4\r\n",
      "< ./build/tools/caffe train --solver=examples/mnist/lenet_solver.prototxt\r\n",
      "---\r\n",
      "> #./build/tools/\r\n",
      "> yhrun -p nsfc2 -n 1 caffe train --solver=examples/mnist/lenet_solver.prototxt &> examples/mnist/lenet.log\r\n"
     ]
    }
   ],
   "source": [
    "diff examples/mnist/train_lenet.sh  examples/mnist/train_lenet_th.sh "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这个文件的修改是为了使用 SLURM 作业系统，通过 yhrun 来提交计算任务，并且创建了日志文件 examples/mnist/lenet.log 。\n",
    "\n",
    "**另外， 由于是使用 CPU 来计算 ， 需要修改 examples/mnist/lenet_solver.prototxt  中的solver_mode: GPU 为 solver_mode: CPU； 这里不再展示。**\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Submitted batch job 383308\r\n"
     ]
    }
   ],
   "source": [
    "yhbatch -p nsfc2 examples/mnist/train_lenet_th.sh "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里可以直接执行此脚本，不过最好还是用yhbatch 提交作业"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "             JOBID PARTITION     NAME         USER ST       TIME  NODES NODELIST(REASON)\r\n",
      "            383308     nsfc2 train_le nscc-gz_jian  R       0:17      1 cn8035\r\n"
     ]
    }
   ],
   "source": [
    "yhqueue"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "作业已经在运行了，可以用tail 查看log (加上 -f 参数可以连续查看，但演示环境不允许)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I0520 16:29:50.440232 19425 sgd_solver.cpp:106] Iteration 1400, lr = 0.00906403\r\n",
      "I0520 16:29:55.268719 19425 solver.cpp:337] Iteration 1500, Testing net (#0)\r\n",
      "I0520 16:29:58.517141 19425 solver.cpp:404]     Test net output #0: accuracy = 0.9833\r\n",
      "I0520 16:29:58.517194 19425 solver.cpp:404]     Test net output #1: loss = 0.0494627 (* 1 = 0.0494627 loss)\r\n",
      "I0520 16:29:58.576706 19425 solver.cpp:228] Iteration 1500, loss = 0.0795374\r\n",
      "I0520 16:29:58.576748 19425 solver.cpp:244]     Train net output #0: loss = 0.0795374 (* 1 = 0.0795374 loss)\r\n",
      "I0520 16:29:58.576759 19425 sgd_solver.cpp:106] Iteration 1500, lr = 0.00900485\r\n",
      "I0520 16:30:03.441340 19425 solver.cpp:228] Iteration 1600, loss = 0.115878\r\n",
      "I0520 16:30:03.441392 19425 solver.cpp:244]     Train net output #0: loss = 0.115878 (* 1 = 0.115878 loss)\r\n",
      "I0520 16:30:03.441401 19425 sgd_solver.cpp:106] Iteration 1600, lr = 0.00894657\r\n"
     ]
    }
   ],
   "source": [
    "tail examples/mnist/lenet.log "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "运算一切正常，本演示结束 。 \n",
    "\n",
    "笔者目前对ML/DL 颇感兴趣 ， 欢迎合作交流。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Bash",
   "language": "bash",
   "name": "bash"
  },
  "language_info": {
   "codemirror_mode": "shell",
   "file_extension": ".sh",
   "mimetype": "text/x-sh",
   "name": "bash"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
